Using Map and Reduce for Querying Distributed XML Data
نویسندگان
چکیده
Semi-structured information is often represented in the XML format. Although, a vast amount of appropriate databases exist that are responsible for efficiently storing semistructured data, the vastly growing data demands larger sized databases. Even when the secondary storage is able to store the large amount of data, the execution time of complex queries increases significantly, if no suitable indexes are applicable. This situation is dramatic when short response times are an essential requirement, like in the most real-life database systems. Moreover, when storage limits are reached, the data has to be distributed to ensure availability of the complete data set. To meet this challenge this thesis presents two approaches to improve query evaluation on semistructured and large data through parallelization. First, we analyze Hadoop and its MapReduce framework as candidate for our distributed computations and second, then we present an alternative implementation to cope with this requirements. We introduce three distribution algorithms usable for XML collections, which serve as base for our distribution to a cluster. Furthermore, we present a prototype implementation using a current open source database, named BaseX, which serves as base for our comprehensive query results.
منابع مشابه
A Map-Reduce algorithm for querying linked data based on query decomposition into stars
In this paper, we investigate the problem of e cient querying large amount of linked data using Map-Reduce framework. We assume data graphs that are arbitrarily partitioned in the distributed file system. Our technique focuses on the decomposition of the query posed by the user, which is given in the form of a query graph into star subqueries. We propose a two-phase, scalable Map-Reduce algorit...
متن کاملA Dynamic Compressed Accessibility Map for Secure XML Querying and Updating
By specifying a fine-grained access control on the XML data, an accessibility map is required to determine the accessibility of XML nodes for a specific subject (e.g. user or role) under a specific action (e.g. read or write). In the recent years, several research works have been done to reduce the overall storage cost of accessibility map with rapid determination of accessibility of XML nodes ...
متن کاملPersistent Storage and Querying of Compressed Xml Documents on the Web
We describe the design and implementation of a Web-based distributed system called TREESTORE, intended for storing compressed XML documents in a relational database. The use of a database is fully portable, requiring minimal changes to application code to substitute one database management system for another. In TREESTORE, compressed XML documents are shredded into a fixed number of relational ...
متن کاملXML Query Routing in Structured P2P Systems
This paper addresses the problem of data placement, indexing, and querying large XML data repositories distributed over an existing P2P service infrastructure. Our architecture scales gracefully to the network and data sizes, is fully distributed, fault tolerant and selforganizing, and handles complex queries efficiently, even those queries that use full-text search. Our framework for indexing ...
متن کاملMapping XML to inverted indexed circular linked lists
Extensible Markup Language (XML) has become the de facto standard for data exchange on the World Wide Web and is widely used in many fields, so it is urgent to develop some efficient methods to manage, store, and query XML data. Traditional methods use relational databases to store XML data which take advantage of mature technologies of relational databases. But it needs to map XML schemas to r...
متن کامل